Using the Gene Ontology to generate features for protein interaction prediction was used by Qi. For each annotation a gene has in the Gene Ontology database it can be compared with the other gene in it's pair and if they share it the feature is 1, otherwise it is 0. To do this processing, however, first the annotations for each gene must be obtained from databases online. It appears that the Entrez gene database's original paper reports that it contains Gene ontology information so all that's required is to use Biopython to query this database.

Annotating Gene IDs with Biopython

The following code is taken from the Biopython tutorial and cookbook.


In [2]:
import sys
 
from Bio import Entrez
 
# *Always* tell NCBI who you are
Entrez.email = "gavingray1729@gmail.com"
 
def retrieve_annotation(id_list):
 
    """Annotates Entrez Gene IDs using Bio.Entrez, in particular epost (to
    submit the data to NCBI) and esummary to retrieve the information. 
    Returns a list of dictionaries with the annotations."""
 
    request = Entrez.epost("gene",id=",".join(id_list))
    try:
        result = Entrez.read(request)
    except RuntimeError as e:
        #FIXME: How generate NAs instead of causing an error with invalid IDs?
        print "An error occurred while retrieving the annotations."
        print "The error returned was %s" % e
        sys.exit(-1)
 
    webEnv = result["WebEnv"]
    queryKey = result["QueryKey"]
    data = Entrez.esummary(db="gene", webenv=webEnv, query_key =
            queryKey)
    annotations = Entrez.read(data)
 
    print "Retrieved %d annotations for %d genes" % (len(annotations),
            len(id_list))
 
    return annotations

Trying this out with some arbitrary gene ID:


In [3]:
cd /home/gavin/Documents/MRes/DIP/human/


/home/gavin/Documents/MRes/DIP/human

In [4]:
import csv

In [5]:
ls


DIPtouniprot.tab  flat.Entrez.txt   Hsapi20140427.txt    interacting.Entrez.txt   training.negative.Entrez.txt          training.nolabel.positive.Entrez.txt  uniprottoEntrez.tab
flat.DIP.txt      flat.uniprot.txt  interacting.DIP.txt  interacting.uniprot.txt  training.nolabel.negative.Entrez.txt  training.positive.Entrez.txt

In [6]:
dipids = list(flatten(csv.reader(open("flat.Entrez.txt"))))

Testing for the first of these:


In [7]:
annotation1 = retrieve_annotation(dipids[0])


Retrieved 4 annotations for 4 genes

Ok, so why is it retrieving for annotations for a single gene identifier? Surely it should only retrieve one annotation? Looking at these annotations using another function from the cookbook:


In [8]:
def print_data(annotation):
    for gene_data in annotation:
        gene_id = gene_data["Id"]
        gene_symbol = gene_data["NomenclatureSymbol"]
        gene_name = gene_data["Description"]
        print "ID: %s - Gene Symbol: %s - Gene Name: %s" % (gene_id, gene_symbol, gene_name)

In [9]:
print_data(annotation1)


ID: 6 - Gene Symbol:  - Gene Name: adenovirus-12 chromosome modification site 1B
ID: 3 - Gene Symbol: A2MP1 - Gene Name: alpha-2-macroglobulin pseudogene 1
ID: 6 - Gene Symbol:  - Gene Name: adenovirus-12 chromosome modification site 1B
ID: 7 - Gene Symbol:  - Gene Name: adenovirus-12 chromosome modification site 17

Looks like it's not expecting a string, maybe it expects a list?


In [10]:
annotation1 = retrieve_annotation([dipids[0]])


Retrieved 1 annotations for 1 genes

In [11]:
print_data(annotation1)


ID: 6367 - Gene Symbol: CCL22 - Gene Name: chemokine (C-C motif) ligand 22

Writing my own function to print out everything nicely:


In [12]:
def print_data_all(annotation):
    for a in annotation:
        for k in a.keys():
            print k+" : "+"%s"%a[k]
    return None

In [13]:
print_data_all(annotation1)


Mim : [602957]
Orgname : Homo sapiens
TaxID : 9606
GeneWeight : 11556
GeneticSource : genomic
ChrSort : 16
Status : 0
OtherDesignations : C-C motif chemokine 22|CC chemokine STCP-1|MDC(1-69)|macrophage-derived chemokine|small inducible cytokine A22|small inducible cytokine subfamily A (Cys-Cys), member 22|small-inducible cytokine A22|stimulated T cell chemotactic protein 1|stimulated T-cell chemotactic protein 1
Description : chemokine (C-C motif) ligand 22
CurrentID : 0
GenomicInfo : [{'ChrAccVer': 'NC_000016.10', 'ChrLoc': '16', 'ExonCount': 3, 'ChrStop': 57366189, 'ChrStart': 57358782}]
NomenclatureName : chemokine (C-C motif) ligand 22
NomenclatureSymbol : CCL22
Name : CCL22
Summary : This gene is one of several Cys-Cys (CC) cytokine genes clustered on the q arm of chromosome 16. Cytokines are a family of secreted proteins involved in immunoregulatory and inflammatory processes. The CC cytokines are proteins characterized by two adjacent cysteines. The cytokine encoded by this gene displays chemotactic activity for monocytes, dendritic cells, natural killer cells and for chronically activated T lymphocytes. It also displays a mild activity for primary activated T lymphocytes and has no chemoattractant activity for neutrophils, eosinophils and resting T lymphocytes. The product of this gene binds to chemokine receptor CCR4. This chemokine may play a role in the trafficking of activated T lymphocytes to inflammatory sites and other aspects of activated T lymphocyte physiology. [provided by RefSeq, Jul 2008]
Item : []
ChrStart : 57358782
NomenclatureStatus : Official
OtherAliases : A-152E5.1, ABCD-1, DC/B-CK, MDC, SCYA22, STCP-1
MapLocation : 16q13
Id : 6367
Chromosome : 16

Interesting, but don't see anything called "Gene Ontology" unfortunately. Must be integrated into here somewhere but I don't really understand where.


Mapping to Gene Ontology identifiers

Using a flat file from NBCI on this ftp server can map an Entrez ID to a number of GO IDs:


In [14]:
cd /home/gavin/Documents/MRes/geneconversion/


/home/gavin/Documents/MRes/geneconversion

In [15]:
# read through this file and make a list of GO IDs for the example gene chosen above
c = csv.reader(open("gene2go"), delimiter="\t")
c.next()
examplelist=[]
for l in c:
    if l[1]==dipids[0]:
        examplelist.append(l[2])

In [16]:
print examplelist


['GO:0005576', 'GO:0005615', 'GO:0006935', 'GO:0006954', 'GO:0006955', 'GO:0007165', 'GO:0007267', 'GO:0008009', 'GO:0009615', 'GO:0060326', 'GO:0060326']

In [17]:
import goatools

Then we can parse the obo file obtained from here.


In [18]:
cd /home/gavin/Documents/MRes/geneontology/


/home/gavin/Documents/MRes/geneontology

In [19]:
ls


gene_ontology.1_2.obo

In [25]:
parsedobo = goatools.obo_parser.GODag('gene_ontology.1_2.obo')


load obo file gene_ontology.1_2.obo
42995 nodes imported

In [33]:
for i in examplelist:
    print parsedobo[i].namespace
    print parsedobo[i].name
    print " "


cellular_component
extracellular region
 
cellular_component
extracellular space
 
biological_process
chemotaxis
 
biological_process
inflammatory response
 
biological_process
immune response
 
biological_process
signal transduction
 
biological_process
cell-cell signaling
 
molecular_function
chemokine activity
 
biological_process
response to virus
 
biological_process
cell chemotaxis
 
biological_process
cell chemotaxis
 

In [29]:
dir(parsedobo[id1])


Out[29]:
['__doc__',
 '__init__',
 '__module__',
 '__repr__',
 '__str__',
 '_parents',
 'alt_ids',
 'children',
 'get_all_child_edges',
 'get_all_children',
 'get_all_parent_edges',
 'get_all_parents',
 'has_child',
 'has_parent',
 'id',
 'is_obsolete',
 'level',
 'name',
 'namespace',
 'parents']

So I can retrieve these annotations for arbitrary Entrez IDs. Should then be able to generate the database from this conversion table, using all the gene IDs available in it. I guess the question then is, how many IDs are in it?


In [35]:
cd /home/gavin/Documents/MRes/geneconversion/


/home/gavin/Documents/MRes/geneconversion

In [36]:
# read through this file and make a list of GO IDs for the example gene chosen above
c = csv.reader(open("gene2go"), delimiter="\t")
c.next()
#take second column and make it a set
EntrezIDs = set(zip(*list(c))[1])

In [41]:
import scipy.misc

In [43]:
print "Number of Entrez IDs in the gene2go table: %i"%(len(EntrezIDs))
print "Number of binary combinations of these genes: %i"%(scipy.misc.comb(len(EntrezIDs),2))


Number of Entrez IDs in the gene2go table: 215328
Number of binary combinations of these genes: 23182966128

So the resulting database will have 23 billion keys? Could cause problems, but will continue and see if it does.